The following markdown file follows the exploratory data analysis performed on the employee information data set provided, titled CaseStudy2-data.csv. The purpose of this EDA is to predict Attrition and Monthly income based on the variables available within the data set. Any questions can be addressed to Chad Kwong at ckwong@smu.edu.
The markdown file starts with the initial loading/cleaning of the data and moves on to each aspect of the EDA. Each chunk of code is commented and a summary outline is provided before and after each chunk as well. Any questions can be addressed to Chad Kwong at ckwong@smu.edu.
# Initial Loading of data
AttritionData = read.csv("/Users/chadkwong/Desktop/SMU/GitHub/Case-Study-2/CaseStudy2-data.csv")
str(AttritionData)
## 'data.frame': 870 obs. of 36 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : chr "No" "No" "No" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Sales" ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : chr "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : chr "Male" "Male" "Male" "Female" ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : chr "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : chr "Divorced" "Single" "Single" "Married" ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "No" "No" "No" "No" ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
# Cleaning and Altering data for initial investigation
AttritionData$Sorted_MonthlyIncome = cut(AttritionData$MonthlyIncome, breaks = c(0,4000,8000,12000,16000,20000), labels = c("Under 4k", "4k-8k", "8k-12k", "12k-16k", "16k+"))
AttritionData$Sorted_YearsAtCompany = cut(AttritionData$YearsAtCompany, breaks = c(-1,4,8,12,16,max(AttritionData$YearsAtCompany)), labels = c("Under 4","4-8","8-12","12-16","16+"))
AttritionData$Attrition = factor(AttritionData$Attrition)
AttritionData$JobRole = factor(AttritionData$JobRole)
AttritionData$TotalWorkingYears = factor(AttritionData$TotalWorkingYears)
AttritionData$YearsAtCompany = factor(AttritionData$YearsAtCompany)
AttritionData$YearsWithCurrManager = factor(AttritionData$YearsWithCurrManager)
yes = which(AttritionData$Attrition == "Yes")
AttritionYes = AttritionData[yes,]
table(is.na(AttritionData))
##
## FALSE
## 33060
No missing values were found in the data set.
The following code outlines my thought process around my initial investigation into predicting attrition based on 3 variables. This investigation was performed using empirical interpretation without supporting evidence.
# Box plot of Stock Option Level and Monthly Income based on all Yes responses to Attrition
StockIncomeBox = ggplot(data = AttritionYes, aes(x = StockOptionLevel, y = MonthlyIncome)) + geom_boxplot()
# plots exploring relationship between Stock Option Level and Monthly Income
StockHist = ggplot(AttritionData, aes(x = StockOptionLevel, fill = Sorted_MonthlyIncome)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
StockHist
StockIncomePoint = ggplot(data = AttritionData, aes(x = StockOptionLevel, y = MonthlyIncome, color = Sorted_MonthlyIncome)) + geom_point() + facet_wrap(~Attrition)
# exploring distribution of monthly income and relationship to job level
IncomeHist = ggplot(AttritionData, aes(x = Sorted_MonthlyIncome, fill = Sorted_MonthlyIncome)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
IncomeHist
SortedIncomePoint = ggplot(AttritionData, aes(x = JobLevel, y = MonthlyIncome, color = Sorted_MonthlyIncome)) + geom_point()
SortedIncomeBox = ggplot(AttritionData, aes(x = MonthlyIncome, y = JobLevel, color = Sorted_MonthlyIncome)) + geom_boxplot()
# Histogram of years at company divided by Attrition
YearHist = ggplot(AttritionData, aes(x = Sorted_YearsAtCompany, fill = Sorted_YearsAtCompany)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
YearHist
# Histogram of Education divided by Attrition
EducationHist = ggplot(AttritionData, aes(x = AttritionData$Education)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
# Histogram of Job Satisfaction divided by Attrition
JobSatisfHist = ggplot(AttritionData, aes(x = AttritionData$JobSatisfaction)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
# Histogram of Job Roles
JobRoleHist = ggplot(AttritionData, aes(x = AttritionData$JobRole)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
# Histogram of Environment Satisfaction
environmentsat = ggplot(AttritionData, aes(x = AttritionData$EnvironmentSatisfaction)) + geom_histogram(stat = "count") + facet_wrap(~Attrition)
With all the plots above in mind, I chose to further investigate Monthly Income, Stock Option level, and Years at Company.
I chose to attempt to use Learning Vector Quantization algorithm to compute the importance of variables within the attrition data set and use a repeated cross validation for verifying the model.
# Using LVQ to predict variables and then performing a cross validation. Importance of the variables are then computed and displayed on a graph
#creating and training the model
model <- train(Attrition~., data=AttritionData[,c(2,3,5,7,8,11,12,14,15,16,18,20,21,22,25,26,27,29,30,31,32,33,34,35,36)], method="lvq", preProcess="scale", trControl=trainControl(method="cv"))
# estimating variable importance using the model
importance <- varImp(model, scale=FALSE)
# summarizing the importance
print(importance)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 24)
##
## Importance
## MonthlyIncome 0.6567
## TotalWorkingYears 0.6563
## YearsAtCompany 0.6470
## StockOptionLevel 0.6455
## JobLevel 0.6406
## YearsInCurrentRole 0.6403
## YearsWithCurrManager 0.6291
## Age 0.6265
## JobInvolvement 0.6159
## JobSatisfaction 0.5833
## DistanceFromHome 0.5586
## EnvironmentSatisfaction 0.5532
## WorkLifeBalance 0.5491
## TrainingTimesLastYear 0.5428
## Education 0.5384
## NumCompaniesWorked 0.5354
## MonthlyRate 0.5335
## HourlyRate 0.5282
## DailyRate 0.5277
## RelationshipSatisfaction 0.5268
# plotting the importance
plot(importance, main = "Ranking of Variables")
The graph above predicts that the top 3 contributing variables are Monthly Income, Total Working Years, and Years at Company when predicting Attrition.
I chose to create a Naive Bayes model using the Monthly Income, Total Working Years, and Years at company. I then wanted to create a model using the best possible seed, so I calculated the best seed based off of the first 50 seeds. I then choose to compare the model using the predicted variables to the model created using my own chosen variables and then build the superior model with the optimum seed value.
## Variables to use: Monthly Income(c20)
## Considerable Variables: Stock Option Level(c29) Years at Company(c33), Total Working Years(c30)
# Testing Accuracy of top 3 predicted variables: Monthly Income, Total Working Years, Years at Company
nums = data.frame(accuracy = numeric(50), seed = numeric(50))
for(i in 1:50)
{
set.seed(i)
Indices = sample(seq(1:length(AttritionData$Attrition)),round(.70*length(AttritionData$Attrition)))
train = AttritionData[Indices,]
test = AttritionData[-Indices,]
train = upsample(train, cat_col = "Attrition")
train = upsample(train, cat_col = "Attrition")
model = naiveBayes(Attrition ~ MonthlyIncome + TotalWorkingYears + YearsAtCompany, data = train[,c(3,20,33,30)], laplace = 10)
predictions = predict(model, test)
CM = confusionMatrix(predictions,test$Attrition)
nums$accuracy[i] = CM$overall[1]
nums$seed[i] = i
nums$Sensitivity[i] = CM$byClass[1]
nums$Specificity[i] = CM$byClass[2]
}
# Displaying row with highest accuracy
nums[which(nums$accuracy == max(nums$accuracy)),]
## accuracy seed Sensitivity Specificity
## 20 0.7509579 20 0.7733333 0.6111111
# Displaying Confusion Matrix Statistics for all seeds 1-50
legendcolors <- c("Accuracy" = "#AB0E14", "Sensitivity" = "#F1B11B", "Specificity" = "#957531")
ggplot(nums) + geom_line(aes(x = seed, y = accuracy, color = "Accuracy")) + geom_line(aes(x = seed, y = Sensitivity, color = "Sensitivity")) + geom_line(aes(x = seed, y = Specificity, color = "Specificity")) + labs(x = "Seed", y ="Score", color = "Legend") + scale_color_manual(values = legendcolors) + ggtitle("Confusion Matrix Statistics for 50 Seeds")
# Testing Accuracy of variables chosen from initial investigation: Monthly Income, Stock Option Level, Years at Company
accs = data.frame(accuracy = numeric(50), seed = numeric(50))
for(i in 1:50)
{
set.seed(i)
Indices = sample(seq(1:length(AttritionData$Attrition)),round(.70*length(AttritionData$Attrition)))
train = AttritionData[Indices,]
test = AttritionData[-Indices,]
train = upsample(train, cat_col = "Attrition")
model = naiveBayes(Attrition ~ ., data = train[,c(3,20,33,29)])
predictions = predict(model, test)
CM = confusionMatrix(predictions,test$Attrition)
accs$accuracy[i] = CM$overall[1]
accs$seed[i] = i
accs$Sensitivity[i] = CM$byClass[1]
accs$Specificity[i] = CM$byClass[2]
}
# Displaying highest accuracy for comparison
accs[which(accs$accuracy == max(accs$accuracy)),]
## accuracy seed Sensitivity Specificity
## 20 0.697318 20 0.7200000 0.5555556
## 41 0.697318 41 0.7196262 0.5957447
## Model Using Monthly Income, Total Working Years, and Years at Company with seed 20
set.seed(20)
#creating training and test sets
Indices = sample(seq(1:length(AttritionData$Attrition)),round(.70*length(AttritionData$Attrition)))
train = AttritionData[Indices,]
test = AttritionData[-Indices,]
#upsampling Attrition column for model
train = upsample(train, cat_col = "Attrition")
train = upsample(train, cat_col = "Attrition")
model = naiveBayes(Attrition ~ ., data = train[,c(3,20,33,30)])
predictions = predict(model, test)
confusionMatrix(predictions,test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 175 14
## Yes 50 22
##
## Accuracy : 0.7548
## 95% CI : (0.698, 0.8057)
## No Information Rate : 0.8621
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2739
##
## Mcnemar's Test P-Value : 1.214e-05
##
## Sensitivity : 0.7778
## Specificity : 0.6111
## Pos Pred Value : 0.9259
## Neg Pred Value : 0.3056
## Prevalence : 0.8621
## Detection Rate : 0.6705
## Detection Prevalence : 0.7241
## Balanced Accuracy : 0.6944
##
## 'Positive' Class : No
##
# Writing attrition predictions to a new CSV file with the test data set
test$PredictedAttrition = predictions
AttritionPredictions = test[,c(1,3,20,30,33,39)]
write.csv(AttritionPredictions,"/Users/chadkwong/Desktop/SMU/GitHub/Case-Study-2/Attrition Predictions.csv", row.names = FALSE)
The above optimized Naive Bayes model provides an accuracy of ~75%, sensitivity of ~.7778, and specificity of ~.6111. With a small P-value of 1.2e-05, there is enough evidence to support the claim that Monthly Income, Years at Company, and Total Working Years are considerable variables when predicting Attrition.
When predicting monthly income, I chose to use the variables used in my previous model NB model. Using a linear regression model, I predicted the relationship between Monthly Income, Years at Company, and Total Working Years. I then perform a Leave One Out Cross Validation on the fit used and display the relationship between all 3 variables within a 3d scatter plot.
# Regression Model for Predicting monthly income
fit = lm(MonthlyIncome ~ TotalWorkingYears + YearsAtCompany, data = AttritionData)
summary(fit)
##
## Call:
## lm(formula = MonthlyIncome ~ TotalWorkingYears + YearsAtCompany,
## data = AttritionData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8819.4 -1414.2 -150.9 1162.4 10056.2
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1653.000 927.131 1.783 0.074978 .
## TotalWorkingYears1 659.540 1178.258 0.560 0.575801
## TotalWorkingYears2 971.921 1236.126 0.786 0.431945
## TotalWorkingYears3 1056.231 1168.557 0.904 0.366333
## TotalWorkingYears4 2309.748 1154.580 2.001 0.045782 *
## TotalWorkingYears5 2141.839 1128.046 1.899 0.057961 .
## TotalWorkingYears6 2652.052 1114.974 2.379 0.017613 *
## TotalWorkingYears7 2396.310 1126.791 2.127 0.033753 *
## TotalWorkingYears8 2418.744 1107.618 2.184 0.029271 *
## TotalWorkingYears9 4780.867 1132.237 4.222 2.69e-05 ***
## TotalWorkingYears10 4432.998 1107.570 4.002 6.85e-05 ***
## TotalWorkingYears11 4556.978 1226.265 3.716 0.000216 ***
## TotalWorkingYears12 4700.550 1197.884 3.924 9.46e-05 ***
## TotalWorkingYears13 4728.396 1184.442 3.992 7.15e-05 ***
## TotalWorkingYears14 4758.494 1215.498 3.915 9.82e-05 ***
## TotalWorkingYears15 6337.903 1179.512 5.373 1.01e-07 ***
## TotalWorkingYears16 7563.668 1207.858 6.262 6.19e-10 ***
## TotalWorkingYears17 5030.989 1236.381 4.069 5.19e-05 ***
## TotalWorkingYears18 6339.018 1285.335 4.932 9.91e-07 ***
## TotalWorkingYears19 4591.233 1293.715 3.549 0.000409 ***
## TotalWorkingYears20 5763.967 1273.193 4.527 6.89e-06 ***
## TotalWorkingYears21 15677.783 1313.646 11.935 < 2e-16 ***
## TotalWorkingYears22 14399.083 1264.673 11.386 < 2e-16 ***
## TotalWorkingYears23 12739.309 1301.595 9.787 < 2e-16 ***
## TotalWorkingYears24 13856.048 1434.186 9.661 < 2e-16 ***
## TotalWorkingYears25 13421.651 1607.448 8.350 2.99e-16 ***
## TotalWorkingYears26 17108.901 1431.587 11.951 < 2e-16 ***
## TotalWorkingYears27 17998.967 1866.429 9.644 < 2e-16 ***
## TotalWorkingYears28 13427.767 1500.291 8.950 < 2e-16 ***
## TotalWorkingYears29 14815.141 1721.993 8.603 < 2e-16 ***
## TotalWorkingYears30 12357.986 1573.939 7.852 1.32e-14 ***
## TotalWorkingYears31 11987.898 1724.372 6.952 7.47e-12 ***
## TotalWorkingYears32 13962.825 1722.533 8.106 1.95e-15 ***
## TotalWorkingYears33 14244.903 2340.343 6.087 1.79e-09 ***
## TotalWorkingYears34 17576.915 2698.136 6.514 1.29e-10 ***
## TotalWorkingYears35 16645.812 2060.670 8.078 2.42e-15 ***
## TotalWorkingYears36 17157.989 1683.709 10.191 < 2e-16 ***
## TotalWorkingYears37 13940.874 2916.218 4.780 2.08e-06 ***
## TotalWorkingYears40 8659.000 2622.322 3.302 0.001002 **
## YearsAtCompany1 -191.915 635.111 -0.302 0.762596
## YearsAtCompany2 -2.334 625.938 -0.004 0.997026
## YearsAtCompany3 -68.030 606.624 -0.112 0.910737
## YearsAtCompany4 -755.249 642.173 -1.176 0.239911
## YearsAtCompany5 -175.982 598.075 -0.294 0.768645
## YearsAtCompany6 403.990 667.890 0.605 0.545433
## YearsAtCompany7 -811.513 669.397 -1.212 0.225753
## YearsAtCompany8 478.493 654.639 0.731 0.465037
## YearsAtCompany9 -18.191 667.770 -0.027 0.978274
## YearsAtCompany10 -345.434 651.363 -0.530 0.596034
## YearsAtCompany11 -99.903 856.933 -0.117 0.907220
## YearsAtCompany12 -1369.154 1024.069 -1.337 0.181611
## YearsAtCompany13 682.476 803.520 0.849 0.395936
## YearsAtCompany14 1132.948 912.088 1.242 0.214545
## YearsAtCompany15 -2150.742 998.578 -2.154 0.031553 *
## YearsAtCompany16 697.126 1275.835 0.546 0.584938
## YearsAtCompany17 -1528.814 1291.647 -1.184 0.236916
## YearsAtCompany18 -138.046 1086.428 -0.127 0.898921
## YearsAtCompany19 -324.288 1139.371 -0.285 0.776009
## YearsAtCompany20 -2790.696 971.153 -2.874 0.004166 **
## YearsAtCompany21 -670.088 1266.528 -0.529 0.596901
## YearsAtCompany22 719.830 1013.225 0.710 0.477641
## YearsAtCompany23 -2138.651 2782.323 -0.769 0.442324
## YearsAtCompany24 -3845.496 1503.802 -2.557 0.010736 *
## YearsAtCompany25 1835.225 2054.584 0.893 0.372000
## YearsAtCompany26 -4386.670 1726.836 -2.540 0.011264 *
## YearsAtCompany30 -4370.825 2850.360 -1.533 0.125565
## YearsAtCompany31 5072.102 2263.269 2.241 0.025296 *
## YearsAtCompany32 1775.483 1985.653 0.894 0.371507
## YearsAtCompany33 -2441.915 3070.646 -0.795 0.426707
## YearsAtCompany40 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2453 on 801 degrees of freedom
## Multiple R-squared: 0.7376, Adjusted R-squared: 0.7154
## F-statistic: 33.12 on 68 and 801 DF, p-value: < 2.2e-16
preds = predict(fit)
# Performing LOO Cross Validation for both models
train(MonthlyIncome ~ TotalWorkingYears + YearsAtCompany, data = AttritionData, method = "lm", trControl = trainControl(method = "LOOCV"))
## Linear Regression
##
## 870 samples
## 2 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 869, 869, 869, 869, 869, 869, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2742.508 0.6473053 2004.609
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
surface_plot <- plot_ly(AttritionData,
x = ~TotalWorkingYears,
y = ~YearsAtCompany,
z = ~MonthlyIncome,
name = "Attrition Data",
type = "scatter3d",
mode = "markers",
size = .75)
# Adding Title to Graph and proper labels for axis
surface_plot <- surface_plot %>% layout(title = "Relationship Between Monthly Income, Years At Company, and Total Working Years", xaxis = list(title = 'Total Working Years',zeroline = TRUE), yaxis = list(title = 'Years at Company'), zaxis = list(title = 'Monthly Income'))
# Adding in predicted Monthly Income
surface_plot <-surface_plot %>% add_trace(z = ~preds, name = "Predicted Monthly Income", mode = "markers")
# Displaying Plot
surface_plot
Using the model, we can observe a RMSE of ~2743 which is within the threshold of 3000. A p-value of 2.2e-16 is observed as well. From the plot above, we can also see that the predicted monthly income falls within the distribution of the actual data set. I can thus conclude that the fit supports the relationship between Monthly Income, Years at Company, and Total Working Years. However, I speculate that there is not enough information to confirm that Monthly income is solely dependent on Years At Company and Total Years Working